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Abstract 

We consider the problem of multivariate regression in a setting where the relevant predictors could be shared 
among different responses. We propose an algorithm which decomposes the coefficient matrix into the product 
of a long matrix and a wide matrix, with an elastic net penalty on the former and an t\ penalty on the latter. 
The first matrix linearly transforms the predictors to a set of latent factors, and the second one regresses 
the responses on these factors. Our algorithm simultaneously performs dimension reduction and coefficient 
estimation and automatically estimates the number of latent factors from the data. Our formulation results in a 
non-convex optimization problem, which despite its flexibility to impose effective low-dimensional structure, is 
difficult, or even impossible, to solve exactly in a reasonable time. We specify an optimization algorithm based 
on alternating minimization with three different sets of updates to solve this non-convex problem and provide 
theoretical results on its convergence and optimality. Finally, we demonstrate the effectiveness of our algorithm 
via experiments on simulated and real data. 
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1. Introduction 

Multivariate regression analysis, also known as multiple-output regression, is concerned with modelling the 
relationships between a set of real-valued output vectors, known as responses, and a set of real-valued input 
vectors, known as predictors or features. The multivariate responses are measured over the same set of predictors 
and are often correlated. Hence, the goal of multivariate regression is to exploit these dependencies to learn 
a predictive model of responses based on an observed set of input vectors paired with corresponding outputs. 
Multiple-output regression can also be seen as an instance of the problem of multi-task learning, where each task 
is defined as predicting individual responses based on the same set of predictors. The multivariate regression 
problem is encountered in numerous fields including finance [20], computational biology [23], geostatistics m , 
chemometrics 1451 . and neuroscience Ba¬ 
in this paper, we are interested in multivariate regression tasks where it is reasonable to believe that the 
responses are related to factors, each of which is a sparse linear combination of the predictors. Our model further 
assumes that the relationships between the factors and the responses are sparse. This type of structure occurs 
in a number of applications and we provide two examples in later sections. 

Given p-dimensional predictors x, = (xn ,..., Xi P ) T € R p and g-dimensional responses yi = (yu,..., yi q ) T € 
R q for the z-th sample, we assume there is a linear relationship between the inputs and outputs as follows: 

yi = D T Xi + e*, i = l,...,N, (1) 

where D pxg is the regression coefficient matrix and e, = (e,i,..., Ci q ) is the vector of errors for the z-th sample. 
We can combine these N equations into a single matrix formula: 

Y = XD + E, (2) 

where X denotes the nx p matrix of predictors with xR as its z-th row, Y denotes the nx q matrix of responses 
with yi T as its z-th row, and E denotes the n x q matrix of errors with e, T as its z-th row. For q = 1, this 
multivariate linear regression model reduces to the well-known, univariate linear regression model. 

We assume that the columns of X and Y are centred and hence the intercept terms are omitted and the 
columns of X are normalized. We also assume that the error vectors for N samples are iid Gaussian random 
vectors with zero mean and covariance E, i.e. e, ~ AF( 0 , E), i = 1 ,..., N. 




Iii the absence of additional structure, many standard procedures for solving ([ 2 ]), such as linear regression 
and principal component analysis, are not consistent unless p/ro — » 0. Thus, in a high-dimensional setting where 
p is comparable to or greater than n, we need to impose some low-dimensional structure on the coefficient 
matrix. For instance, element-wise sparsity can be imposed by constraining the £\ norm of the coefficient 
matrix, ||D||i ; i [551 HI] . This regularization is equivalent to solving q separate univariate lasso regressions for 
every response; thus we consider tasks separately. Another way to introduce sparsity is to consider the mixed 
£i/£-y norms (7 > 1). In this approach (sometimes called group lasso), the mixed norms impose a block- 
sparse structure where each row is either all zero or mostly zeros. Particular examples, among many other 
works, include results using the t\/£oo norm [551155] , and the £ 1/(2 norm @2] [25]. Also, there are the so-called 
“dirty” models which are superpositions of simpler low-dimensional structures such as element-wise and row-wise 
sparsity psim or sparsity and low rank mm- 

Another approach is to impose a constraint on the rank of the coefficient matrix. In this approach, instead 
of constraining the regression coefficients directly, we can apply penalty functions on the rank of D, its singular 
values and/or its singular vectors [2313 IQ; 46) ;3Tj. These algorithms belong to a broad family of dimension- 
reduction methods known as linear factor regression , where the responses are regressed on a set of factors 
achieved by a linear transformation of the predictors. The coefficient matrix is decomposed into two matrices: 
D = A pxm B mxg . Matrix A transforms the predictors into to latent factors, and matrix B determines the factor 
loadings. 

Our contributions 

Here, we propose a novel algorithm which performs sparse multivariate factor regression (SMFR). We jointly 
estimate matrices A and B by minimizing the mean-squared error, ||Y —XAB|||., with an elastic net penalty on 
A (which promotes grouping of correlated predictors and the interpretability of the factors) and an l\ penalty on 
B (which enhances the accuracy and interpretability of the predictions). We provide a formulation to estimate 
the number of effective latent factors, to. To the best of our knowledge, our work is the first to strive for 
low-dimensional structure by imposing sparsity on both factoring and loading matrices as well as the grouping 
of the correlated predictors. This can result in a set of interpretable factors and loadings with high predictive 
power; however, these benefits come at the cost of a non-convex objective function. Most current approaches for 
multivariate regression solve a convex problem (either through direct formulation or by relaxation of a non-convex 
problem) to impose low-dimensional structures on the coefficient matrix. Although non-convex formulations, 
such as the one introduced here, can be employed to achieve very effective representations in the context of 
multivariate regression, there are few theoretical performance guarantees for optimization schemes solving such 
problems. We formulate our problem in Section [2[ In Section [3j we propose an alternating minimization scheme 
with three sets of updates to solve our problem and provide theoretical guarantees for its convergence and 
optimality. We show that under mild conditions on the predictor matrix, every limit point of the minimization 
algorithm is a stationary point of the objective function and if the starting point is close enough to a local or 
global minimum, our algorithm converges to that point. Through analysis of simulations on synthetic datasets 
in Section [5] and two real-world datasets in Section [ 6 ] we show that compared to other multivariate regression 
algorithms, our proposed algorithm can provide a more effective representation of the data, resulting in a higher 
predictive power. 

Related Methods 

Many multivariate regression techniques impose a low-dimensional structure on the coefficient matrix. 
Element-wise sparsity, here noted as LASSO, is the most common approach where the cost function is de¬ 
fined as ||D|| 1,1 [HS1 [4T] . An extension of LASSO to the multivariate case is the row-wise sparsity with the fi /^2 
norm as the cost function: ||D|| 1>2 [371125]. Peng et al. proposed a method, called RemMap m , which imposes 
both element-wise and row-wise sparsity and solves the following problem: 

nun || Y - XD|||. + AiHDII,,! + A 2 ||D|| 1>2 - 

In an alternative approach, m extended the partial least squares (PLS) framework by imposing an additional 
sparsity constraint and proposed Sparse PLS (SPLS). 

Another common idea is to employ dimensionality reduction techniques to find the underlying latent structure 
in the data. One of the most basic algorithms in this class is an approach called Reduced Rank Regression 
(RRR) [150] where the sum-of-squares error is minimized under the constraint that rank(D) < r for some 
r < minjp, q}. It is easy to show that one can find a closed-form solution for D based on the singular value 
decomposition of Y T X(X T X) _ 1 X T Y. However, similar to least-squares, the solution of this problem without 
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appropriate regularization exhibits poor predictive performance and is not suitable for high-dimensional settings. 
Another popular approach is to use the trace norm as the penalty function: 

min{p,g} 

mm||Y-XD||| + A £ <jj (D), (3) 

i=i 

where ctj (D) denotes the j’th singular value of D. The trace norm regularization has been extensively studied 
in the literature [TB] 3J [13 SSb It imposes sparsity in the singular values of D and therefore, results in a 
low-dimensional solution (higher values of A correspond to achieving solutions of lower rank). 

Many papers study problems of the following form: 

min || Y — XD|| F + g( D) s.t. rank(D) < r < min{p, q}, (4) 

where g(D) is a regularization function over D. For instance, in [24j . a ridge penalty is proposed with g( D) = 
A||D|||. Often it is assumed that D = A pxr B rxg (and thus rank(A) < r, rank(B) < r, and consequently 
rank(D) < r) and the problem is formulated in terms of A and B. In [T9] . g( A, B) = Ai|| A ||+ A 2 1|B||^. An 
algorithm called Sparse Reduced Rank Regression (SRRR) is proposed in |8] and further studied in [2T], where 
g( A, B) = A||A|| 1; 2 with an extra constraint that BB J = I. In [3], g(A,B) = A||B|| 2 1 with an extra constraint 
that A 1 A = I, and it is assumed that p < q and r = p. Dimension reduction is achieved by the constraint on 
B which forces many rows to be zero which consequently cancels the effects of the corresponding columns in A. 

Our problem formulation differs in three important ways: (i) sparsity constraints are imposed on both A 
and B; (ii) the elastic net penalty enables us to control the level of sparsity for each matrix separately and also 
provides the grouping of correlated predictors; and (iii) the number of factors is determined directly, without 
the need for cross-validation. We will discuss the second and third aspects in detail in the next section. The first 
difference has substantial consequences; when decomposing the coefficient matrix into two matrices, the first 
matrix has the role of aggregating the input signals to form the latent factors and the second matrix performs a 
multivariate regression on these factors. Imposing sparsity on A enhances the variable selection as well as the 
interpretability of the achieved factors. Also, as originally motivated by LASSO, we would like to impose the 
sparsity constraint on B in order to improve the interpretability and prediction performance. 

Imposing sparsity on both A and B means that for our model to make sense for a specific problem, the 
outputs should be related in a sparse way to a common set of factors that are derived as a sparse combination of 
the inputs. For example, in analyzing the S&P 50lj^] stocks, it is well-known that returns exhibit much stronger 
correlations for companies that belong to the same industry sector. The memberships of each sector are not 
always so clear, because a company may have several diverse activities that generate revenue. So if we are using 
just concurrent stock returns to try to predict those of other companies, it is reasonable to assume that factors 
representing industry sectors should appear [9]. Since we do not expect the main sectors to overlap much, a 
company will not be present in many factors; so, it is reasonable to assume that A is sparse. Moreover, most 
companies will only be predicted by one or two such factors, so it makes sense that B is also sparse. 

There is a connection between Sparse PCA [53] and our algorithm, in the case where Y is replaced by X 
(i.e., X is regressed on itself). We investigate this relationship in Section |4j 

2. Problem Setup 

In this work, we introduce a novel low-dimensional structure where we decompose D into the product of 
two sparse matrices A pxm and B mxg where m < min (p,q). This decomposition can be interpreted as first 
identifying a set of to factors which are derived by some linear transformation of the predictors (through matrix 
A) and then identifying the transformed regression coefficient matrix B to estimate the responses from these to 
factors. We provide a framework to find to, the number of effective latent factors, as well as the transforming 
and regression matrices, A and B. For a fixed ?n, define: 

A m ,B m = argmin /(A, B), (5) 

Ap X m,B m xq 

where 

/(A,B) = i||Y-XAB|| 2 F + A 1 ||A|| lil + A 2 ||B|| l!l + A 3 ||A|||. (6) 


1 Standard and Poors index of 500 large-cap US equities http://ca.spindices.com/indices/equity/sp-500 
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Then, we solve the following optimization problem: 

m=max(m) < r s.t. rank(A m ) = rank(B m ) =m, (7) 

where r is a problem-specific bound on the number of factors. We then choose Ajj and B^ as solutions. Thus, 
we find the maximum number of factors such that the solution of ([5]) has full rank factor and loading matrices. 
In other words, we find the maximum m such that the best possible regularized reconstruction of responses, 
i.e., the solution of ([ 5 ]) , results in a model where the factors (columns of of A) and their contributions to the 
responses (rows of B) are linearly independent. To achieve this, we initialize A to have r columns, B to have r 
rows, and set m = r, solve the problem ([ 5 ]) ([b]), check for the full rank condition; if not satisfied, set m = m— 1, 
and repeat the process until we identify an m that satisfies the rank condition. 

In the remainder of this section, we first discuss the desirable properties resulting from the choice of our 
penalty function, and then explain the reasoning behind how we estimate to. 


2.1. Controlling Sparsity Levels in Matrices A and B Separately 
Consider the case where A3 = 0. We have: 

min/(A, B) = min |||Y - XAB||| + ArHAI^ + A 2 ||B|j lil 

A.B A,B Z 

= min h\Y - XA'B'H! + ^UA'Hqi + A 2 c||B'|| 1 , 1 

A',B',c Z C 

= min |||Y - XA'B'II^ + ^/a 1 A 2 ||A'|| 1 , 1 ||B'|| 1 , 1) 


( 8 ) 

(9) 

( 10 ) 


where 0 < c, A = A'/c and B' = cB. In problem (101, we do not have any separate control over the sparsity 
levels of A' and B'. If (A',B') is a solution to the last problem, then there is a c* for which (A'/c*,c*B) is 
a solution to the first problem. Therefore, we cannot control the sparsity levels of the two matrices by just 
including two norms. Incorporating the elastic net penalty for A resolves this issue immediately since the 
equivalence between optimization problems (|8| and (10) does not hold any more. 


2.2. Grouping of Correlated Features 

In this section, we show the z’th row of a matrix X by X.j. and its j’th column by X.j . Remember that 
matrix A has the role of combining the relevant features to form the latent factors which will be used later in 
the second layer by matrix B for estimating the outputs. If there are two highly correlated features we expect 
them to be grouped together in forming the factors. In other words, we expect them to be both present in a 
factor or both absent. Inspired by Theorem 1 in the original paper of Zou and Hastie on elastic net [523, we 
prove in this section that elastic net penalty enforces the grouping of correlated features in forming the factors. 

The columns of X correspond to different features. We assume that all columns of X are centred and 
normalized. Thus, the correlation between the i’th and the j’th features is pij = X^X.j. 

Lemma 1. Consider solving the following problem for given Ai and A 3 : 

A = argmin/(A, B) = argmin ^||Y - XAB||| + Ai||A|| m + A 3 || A||^. (11) 

a a ^ 


Then, if AikAjk > 0, we have: 


PA, ^ ^ /- 

||Y|H|B fc .|k |Alfc “ Ajfel “ ^ 2(1 " Plj) (12) 

This lemma says, for instance, that if the correlation between features i and j is really high (i.e., ~ 1), 

then the difference between their corresponding weights in forming the k 'th factor, |Aifc — A^l, would be very 
close to 0. If X..; and X.j are negatively correlated, we can state the same lemma for X..; and —X.j and use 
I Pij I- 


4 







Proof. The condition A ik Ajk > 0 means that A ik ^ 0, A j k / 0, and sign(A i / c ) = sign(Ajfc). So, we have: 


df 


dA lk 

df 


DA. 


jk 


A ifc =A,, 


■A .jk —-A, 


= (-X T YB T ) ?;fc + (X r XABB T ) !t + A lS ign(A lfc ) + 2A 3 A 


ik 


= —(X. i ) r (YB T ). fe + (X.j) T (XABB T ). fc + A lS ign(A ifc ) + 2A 3 A ifc 
= (-X T YB T ), fc + (X T XABB T ) jfc + Aisign(A jfe ) + 2A 3 A jfc 


= -(X./(YB T ). t + (X,) T (XABB T ). t + A 1 sign(Aj fc ) + 2A 3 A )fe 

Due to the optimality of A, both derivatives are equal to zero. Equating the two equations, we get: 

2A 3 (Ajfc - A jk ) = (X.i - X.j) T (Y - XAB)(B fc .f 


(13) 

(14) 

(15) 

(16) 

(17) 


Thus, 

2A 3 |Ajfc - Aj k \ = KX.* - X.j) T (Y - XAB)(B,.) t | (18) 

< ||X.i - X.jIIp’HY - XAB|| F ||B fc .|| F (19) 

Since the columns of X are normalized, we have ||X.j — X.j|| F = ^/2(1 — Pij). Also, by definition, we have: 

/(A, B) < /(0,B) => ||Y - XAB|||. + \i\\A\\i,i + A 3 ||A|| M < ||Y||| =► ||Y - XAB|| F < ||Y|| F (20) 

Combining all these together gives the desired result. □ 


2.3. Estimating the number of effective factors 

In choosing m , we want to avoid both overfitting (large m ) and lack of sufficient learning power (small m). 
In general, we only require m < min(p, q); however, in practical settings where p and q are very large, we impose 
an upper bound on m to have a reasonable number of factors and avoid overfitting. This upper bound, denoted 
by r, is problem-specific and should be chosen by the programmer. For instance, continuing our example of 
analysing the S&P 500 stocks, most economists identify 10-15 primary financial sectors, so a choice of r = 15 
or 20 is reasonable since we expect the factors representing industries to appear, but we allow the algorithm to 
find the optimal m from the data. In order to have the maximum learning power, we find the maximum m < r 
for which the solutions satisfy our rank conditions. This motivates starting with m = r and decreasing it until 
the conditions hold (as opposed to starting with m = 1 and increasing the dimension). 

The full rank conditions are employed to guarantee a good estimate of the number of “effective” factors. An 
effective factor explains some aspect of the response data but cannot be constructed as a linear combination of 
other factors (otherwise it is superfluous). We therefore require the estimated factors to be linearly independent. 
In addition, we require that the rows of B, which determine how the factors affect the responses, are linearly 
independent. If we do not have this latter independence, we could reduce the number of factors and still obtain 
the same relationship matrix D, so at least one of the factors is superfluous. By enforcing that A and B are full 
rank, we make sure that the estimated factors are linearly independent in both senses, and thus m is a good 
estimate of the number of effective factors. 

In estimating the number of factors, we differ fundamentally from the common approach in literature. Setting 
aside the differences in the choice of regularization, most algorithms minimize a cost function as in (|5) for a 
fixed m without the rank condition in |7|. They then use cross-validation to find the optimal m [46l l3lf 0123. 
The criterion in choosing m is thus the cross-validation error. In contrast, we strive to find the largest m such 
that the solutions of (|5| have full rank and the estimated factors are linearly independent. Finding fh via 
cross-validation may result in non-full rank solutions with linearly dependent factors. Therefore, some factors 
can be expressed as linear combinations of other factors and can be viewed as redundant. Including redundant 
factors (via a non-full rank A) could help to improve the sparsity of B, but our goal is to have the minimum 
necessary set of factors, not to have a sparse B at any cost. Later, with experiments on synthetic and real data, 
we show that our approach towards choosing the number of factors results in better predictive performance as 
well as more interpretable factors compared to other techniques that apply cross-validation. 
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3. Optimization Technique and Theoretical Results 

The optimization problem defined in ([5][7| is not a convex problem and it is difficult, if not impossible, to 
solve exactly (i.e., to find the global optimum) in polynomial time. Therefore, we have to employ heuristic 
algorithms, which may or may not converge to a stationary solution [29]. In this section, we propose an 
alternating minimization algorithm with three different sets of updates, and provide theoretical results for each 
of them. 

3.1. Optimization 

For a fixed to, the objective function in ([5]) is biconvex in A and B; it is not convex in general, but is convex 
if either A or B is fixed. Let us define C = (A, B). To solve ([5) for a fixed to, we perform Algorithm 1, with 
an arbitrary, non-zero starting value Co = (Ao,Bo) (see Section 5] for a discussion on the choice of the starting 
point). The stopping criterion is related to the convergence of the value of function /, not the convergence of 
its arguments. In our experiments, we assume / has converged, if < e where the default value of the 

tolerance parameter, e, is IE — 5; i.e., the algorithm stops if the relative changes in / are less than 0.001%. 

Algorithm 1 Solving problem ([5]) for fixed to 

A i — Aq, B i — Bo, i t — 0 

while stopping criterion not satisfied do 

Bj+i t— update B with A fixed at A, 

Aj + i«— update A with B fixed at Bj+i 


( 21 ) 

( 22 ) 


*■<—* + 1 

end while 

A, B ■<— values of A and B at convergence 


We consider three different types of updates: 
• Basic updates: 


Bj+r <- argmin i||Y - XAjB|||. + A 2 ||B|| 1>1 

B 2 

A,-|-i «— argmin —1|Y — XAB^i||^ + A 3 II A||j^ + Ai||A||ip 
A 2 


(23) 

(24) 






Proximal updates: 

B i+1 <- argmin J||Y - XA.,B\\ 2 F + A 2 ||B|| M + ft||B - (25) 

B 2 

Aj+i <— argmin —1|Y — XAB,_|_i||^ + A 3 ||Aj|^-|-Ai||A|ji.i + Oi|| A — A,||p> (26) 

A 2 

where a m i n < at < a max and /3 m j n < /3; < fimax', be., they are bounded from both sides. 

Prox-linear updates: 


Bi+i <— S\ 2 /p i (Bi — g(Ai, Bj)//3j) 
i S\ 1 / a .(Ai h(A^, B^-j-i)/^^) 

where S is the soft-thresholding function, S T (u) = sign(^) max(|i/| — r, 0), and: 


g( a,b) 

h(A, B) 


Jg Q||Y - XAB||^ = -A t X t Y + A t X t XAB 

Q||Y-XAB||| + A 3 ||A||^ = -X t YB t + X t XABB t + 2 A 3 A 


(27) 

(28) 


(29) 

(30) 
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are the derivatives of /(A, B) without the i\ penalties. Also, a,; and /3, are multipliers that have to be 
greater than or equal to the Lipschitz constants of /i(A,B; + i) and g(A, , B) respectively. Since: 


lls^Aj, B) — g(A i: B')|| F = |j AfX T XA i: (B - B')\\ F < HAfX^XA^HIB - B'|| F (31) 

||/i(A,B j+1 )-/i(A',B i+1 )|| F = ||X T X(A-A , )B i+1 Bf +1 +2A 3 (A-A , )|| F (32) 

< (||XX T || F ||B i+1 Bf +1 || F + 2A 3 )|| A - A'|| f , (33) 

we can set 

Pi = ||AfX T XAj|| F (34) 

= ||XX T || F ||B. i+1 Bf +1 || F + 2A 3 (35) 


Finally: 

Bj = B j + ujf (B,; — Bj_i) 
A; = A i + uif(Ai - A^_i) 
, wf £ [0,1) 


(36) 

(37) 

(38) 


are extrapolations which are used to accelerate the convergence. (See Section 4.3 of Pj3] for more details.) 
For our algorithm, we choose 


A ■ fti — 1 1 r \J a i— 1. 

= mm(--- ,d u — 

v 


w? 


• 1 1 r \J — 1 

, L — 


); ti — (1 + y i + 4tf_ ,)/2, t D 


= F 


for some S u < 1. The inequalities uif- < and uif < ' V ^A 1 are necessary for establishing convergence. 

The prox-linear updates need an extra step. If /(A i+1 , B i+1 ) > /(A, , B,), i.e., no descent, the two 
optimization steps are repeated with no extrapolation (coi = 0). This is also necessary for convergence. 


3.2. Theoretical Results 

The theoretical aspects of the alternating minimization (also known as block coordinate descent) algorithms 
have been extensively studied in a wide range of settings with different convexity and differentiability assump¬ 
tions pans edem ii 001 ill]- a full review of this vast literature is beyond the scope of this paper. Here, 
we focus on the recent work of Xu and Yin |44| which considers a setting that matches the problem we study 
in this paper (block multi-convex objective with non-smooth penalty). Xu and Yin show that the solutions of 
proximal and prox-linear updates described above converge to a stationary point of the objective function and 
if the starting point is close to a global minimum, the alternating scheme converges globally. This conclusion is 
possible because the objective function in our problem is the sum of a real analytic function and a semi-algebraic 
function and satisfies the Kurdyka-Lojasiewicz (KL) property. However, their results on convergence do not 
apply to the basic updates, because although the objective ||Y — XABH^ is convex in B, it is not strongly 
convex. In this section, we first present the results for proximal and prox-linear updates from |44| and then 
provide some novel results for the basic update. 


3.2.1. Definitions and convergence of the objective 
Definition 1. C* = (A*,B*) is called a partial optimum of / if 

/(A*,B*) < /(A*,B), VBeP ,x « (39) 

and /(A*,B*) < /(A,B*), VAeR nxm . (40) 

Note that a stationary point of the objective is a partial optimum but the reverse does not generally hold; 
see Section 4 of [42j for an example. 

Definition 2. A point C* is an accumulation point or a limit point of a sequence {Ci}j 6 N, if for any neighbour¬ 
hood V of C*, there are infinitely many j £ N such that C j £ V. Equivalently, C* is the limit of a subsequence 

°f {CijigN- 

Proposition 1. The sequence /(Aj,B,) generated by Algorithm 1 converges monotonically. 
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The value of / is always positive and is reduced in each of the two main steps of Algorithm 1. Thus, it is 
guaranteed that the stopping criterion of Algorithm 1 will be reached. 


3.2.2. Proximal and prox-linear updates 

Since the sequence of solutions generated by the alternating minimization stays in a bounded, closed, and 
hence compact set, it has at least one accumulation point. Assuming that the parameters for the updates are 
chosen as explained above, we can state the following theorem about the accumulation points of the sequence 
of solutions. 


Theorem 1. (Theorem 2.3 from W- For a given starting point Co = (Ao,Bo), let {Cijigpj denote the se¬ 
quence of solutions generated by proximal or prox-linear updates. Then, any accumulation points of the sequence 
C i is a partial minimum of f. 

The next theorem states that under some assumptions, if the sequence C, has a finite accumulation point, 
it converges to that point. 


Theorem 2. (Theorem 2.8 from W- Fov cl given starting point CJq — (.A.o, let {CJ denote the 
sequence of solutions generated by proximal or prox-linear updates and assume that it has a finite accumulation 
point C*. Assuming that V(|||Y — XAB|j|. + A 3 IIA|j|,) is Lipschitz continuous and f satisfies the Kurdyka- 
Lojasiewicz (KL) property at C*„ {Ci}.; e N converges to C*. 

Both of these conditions are held. From (311 (33) we get the Lipschitz continuity. Also since / is the sum 
of real analytic and semi-algebraic functions, it satisfies the KL property (see Section 2.2 of [44] for details of 
the KL property). 

Finally, it is shown in II 441 Lemma 2.6] that if the staring point is sufficiently close to a global minimizer of 
/, then the sequence of solutions will converge to that point. 


3.2.3. Basic updates 

The results of the previous section do not hold for the basic updates because the objective is not strongly 
convex in B. In this part, we provide some novel results for the basic update with the proofs in the Appendix. 


Theorem 3. If the entries ofiX.£ R nxp are drawn from a continuous probability distribution on K rap , then: 

(i) The solution of (23) is unique if Aj is full rank. 

(ii) The objective of is strongly convex and its solution, if one exists, is unique. 

In classical LASSO, the condition on the entries of X is sufficient to achieve solution uniqueness [JBi. For 
LASSO, the continuity is used to argue that the columns of X are in general position with probability 1 (see 
Appendix B Definition [ 3 ] for a formal definition of general position). The affine span of the columns of X, 
{X 1 ,...,X fc+1 }, has Lebesgue measure 0 in R" for a continuous distribution on R n , so there is zero probability 
of drawing Xk +2 in their span. If we multiply X by a matrix A with full column rank, we retain the same 
property, and thus the solution of (23) is unique if A,; is full rank. Also, since the objective function is strictly 
convex in A (due to the elastic net property), if (241 has a solution, its solution is unique. Next, we study the 
properties of C = (A, B) at convergence. 


Theorem 4. Assume that the entries ofH.G R raxp are drawn from a continuous probability distribution on R” p . 
For a given starting point A 0 , let {C;} i6 pj denote the sequence of solutions generated by Algorithm 1. Then: 

(i) {Ci}i £ N has at least one accumulation point. 

(ii) All the accumulation points of {C;}i £ N are partial optima and have the same function value. 

(Hi) If B is full rank for all accumulation points of {C;}igN, then: 


lim ||C i+1 -Ci|| =0, (41) 

i—t 00 


Part (i) follows from the fact that the solutions produced by Algorithm 1 are contained in a bounded, closed 
(and hence compact) set. Although Algorithm 1 converges to a specific value of /, this value can be achieved by 
different values of C. Thus, the sequence C,; can have many accumulation points. Part (ii) of Theorem [4] shows 
that any accumulation point is a partial optimum. Proposition [l] implies that for any given starting point, all 
the associated accumulation points have the same / value. Under the assumption that B is full rank for all 
accumulation points of {C,}j e pj, part (iii) provides a guarantee that the difference between successive solutions 
of the algorithm converges to zero, for both the factor and loading matrices. Although the condition in (41) 
does not guarantee the convergence of the sequence {C^}i e pj, it is close enough for practical purposes. Also, 
note that for finding the number of factors, we require both A and B to be full rank for the final solution. When 
B is full rank, the solutions to both (23) and (24) arc unique and thus A and B will not change in the following 
iterations, i.e., convergence. 
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3.3. Complete Algorithm 

Now, we propose Algorithm 2 to solve the optimization problem described in ([5]|7| to find the number of 
latent factors as well as the factor and loading matrices. Optimal values of Ai, A 2 , A 3 are found via 5-fold cross- 
validation. Although the performance of our algorithm with different updates is roughly similar, our simulations 
show that the prox-linear updates provide the best results both in terms of prediction accuracy and speed of 
convergence. This could be due to the local extrapolation which helps the algorithm avoid small neighbourhoods 
around certain local minima. In the following, we present results for the prox-linear updates. 


Algorithm 2 Sparse Multivariate Factor Regression (SMFR) via Alternating Minimization 

l: Input: Training Set X„ xp ,Y„xo^ Ai, A 2 
2 : Output: Solution of problem ([ 5 J|7J) : A,B,fn 

3: m i — r > r: upper bound on the number of factors 

4: while true do 

5: A <— A 0 G R pxm , i <- 0 

6 : while value of /(A, B) not converged do 

7: B j+i 4— update B with A fixed at A, 

8 : A i+1 «—update A with B fixed at B i+1 

9: *■<—* +1 

10: end while 

11 : A, B •<— values of A and B at convergence 

12 : if rank(A) < to or rank(B) < m then 

13: mi — to — I 

14: else 

15: break 

16: end if 

17: end while 


4. Fully Sparse PCA 

In S3], Zou, Hastie, and Tibshirani propose a sparse PCA, arguing that in regular PCA “each principal 
component is a linear combination of all the original variables, thus it is difficult to interpret the results”. 
Assume that we have a data matrix X nxp with the following SV decomposition: X = UDV T . The principal 
components are defined as Z = UD with the corresponding columns of V as the loadings. In [5T], Zou et al. 
show that solving the following optimization problem leads to exact PCA: 

(A,B) = argmin||X-XAB|||, + AV||A i ||| s.t. BB T =I, 

A B 

’ % 

where A, denotes the i’th column of A. Thus, we have A j oc Vj, and XA, corresponds to the i’th principal 
component. Sparse PCA (SPCA) is introduced by adding an l\ penalty on matrix A to the objective function. 
Zou et al. propose an alternating minimization scheme to solve this problem |53l . 

The ordinary principal components are uncorrelated and their loadings are orthogonal. SPCA imposes 
sparsity on the construction of the principal components. Here sparsity means that each component is a 
combination of only a few of the variables. By enforcing sparsity, the principal components become correlated 
and the loadings are no longer orthogonal. On the other hand, SPCA assumes, like regular PCA, that the 
contributions of these components are orthonormal (BB 2 =1). In our algorithm, if we replace Y with X, i.e., 
regressing X on itself, we get a similar algorithm. However, our work differs in two ways. First, we also impose 
sparsity on the contributions of principal components. This comes at the expense of higher computational costs, 
but results in more interpretable results. Also, the contributions will not be orthonormal anymore. However, 
by the full rank constraint we impose on the two matrices, we make sure that the principal components and 
their contributions are linearly independent. Moreover, the algorithm provides a mechanism for learning from 
the data how many principal components are sufficient to explain the data. 
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5. Simulation Study 

In this section, we use synthetic data to compare the performance of our algorithm with several related 
multivariate regression methods reviewed in the introduction. 


5.1. Simulation Setup 

We generate the synthetic data in accordance with the model described in (2]), Y = XD + E, where D = AB. 
First, we generate an ?zxp predictor matrix, X, with rows independently drawn from A/”(0, Sj), where the (z, j)- 
th element of Sj is defined as crfj = . This is a common model for predictors in the literature |"Ui Tf 

The rows of the n x q error matrix are sampled from A/"(0, S^v), where the (z, j)-th element of is defined as 
= a n ’ 0.4b' *1. The value of cr^ is varied to attain different levels of signal to noise ratio (SNR). Each row 
of the pxro matrix A is chosen by first randomly selecting mo of its elements and sampling them from 7V"(0,1) 
and then setting the rest of its elements to zero. Finally, we generate the m x q matrix B by the element-wise 
product of B = U o W, where the elements of U are drawn independently from Af(0, 1) and elements of W are 
drawn from Bernoulli distribution with success probability s. 

We evaluate the performance of a given algorithm with three different metrics. We evaluate the predictive 
performance over a test set (X iest , Y test ), separate from the training set, in terms of the mean-squared error: 


MSE = 


||X test D-Y fest ||| 

nq 


(42) 


where D is the estimated coefficient matrix. In our case, we have D = AB. We also compare different algorithms 
based on their signed sensitivity and specificity of the support recognition: 


Signed Sensitivity = 
Specificity = 


Sij ' di,j > 0] 

Si,7 7^ 0] 

Sij = 0] ' M^i,j = 0] 
Six = 0] 


where 1 represents the indicator function. 

We compare the performance of our algorithm, SMFR, with many other algorithms reviewed in Section [l] 
as well as a baseline algorithm with a simple ridge penalty. We consider three different regimes: (i) high¬ 
dimensional problems with few instances (50) compared to the number of predictors or responses (50, 100, 
or 150); (ii) problems with increased number of instances (p,q < n); and (iii) problems where the structural 
assumption of our technique is violated. In the first regime, which is of most interest to us due to high- 
dimensionality, we explore different parameter settings. The values of a n and s affect the SNR lower values of 
a n and higher values of s correspond to higher values of SNR (e.g., cr„ = 5, s = 0.1 corresponds to a very low 
SNR). In regime (iii), we violate the assumption about the structure of the coefficient matrix, i.e., D = AB, in 
two ways. In the first case, D has an element-wise sparsity with a density of 20%; in the second, it has row-wise 
sparsity where 60% of the rows are all zeros and the rest have 30% non-zero elements. We consider these cases 
to compare our algorithm with others in an unfavourable setting. 

5.2. Results 

Predictive performance. The means and standard deviations of MSE for different algorithms are presented in 
Table]]] We use five-fold cross-validation to find the tuning parameters of all algorithms. We set r, the maximum 
number of factors, to 20. For the first two simulation regimes, our algorithm outperforms the other algorithms 
and results in lower MSE means and standard deviation. The improvements are more significant in the high¬ 
dimensional settings with high SNR. However, in settings with low SNR (<r„ = 5 ,s = 0.1) or high number of 
instances (n = 500), we still observe lower errors for SMFR. On average, our algorithm reduces the test error 
by 13.2% compared with LASSO, 21.4% compared with 12.3% compared with SRRR, 15.2% compared 

with RernMap, 19.4% compared with SPLS, 39.1% compared with Trace, and 16.7% compared with Ridge. 

In the last two simulations, where the assumed factor structure is abandoned, our algorithm has no advantage 
over simpler methods with no factor structure (such as LASSO) and gives a higher error. 
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Parameters MSE over test set 


n 

P 

q 

m 

m 0 


s 

SMFR 

LASSO 

U /l 2 US] SRRR [H] RemMap [2Z] 

SPLS HBJ 

Trace [T7] 

Ridge 

50 

150 

50 

10 

1 


0.2 

0.070 

0.083 

0.090 

0.084 

0.083 

0.091 

0.088 

0.089 

o 

(0.004) 

(0.005) 

(0.005) 

(0.005) 

(0.005) 

(0.007) 

(0.004) 

(0.004) 




10 

1 


0.4 

0.078 

0.104 

0.105 

0.099 

0.104 

0.110 

0.110 

0.111 




o 

(0.007) 

(0.008) 

(0.007) 

(0.007) 

(0.008) 

(0.008) 

(0.007) 

(0.006) 




10 

1 


0.2 

0.110 

0.118 

0.133 

0.117 

0.123 

0.122 

0.115 

0.122 




0 

(0.004) 

(0.005) 

(0.004) 

(0.005) 

(0.004) 

(0.007) 

(0.005) 

(0.006) 




15 



0.2 

0.071 

0.108 

0.112 

0.109 

0.107 

0.114 

0.109 

0.110 




z 

o 

(0.003) 

(0.006) 

(0.007) 

(0.008) 

(0.006) 

(0.008) 

(0.006) 

(0.008) 

50 

100 

100 

10 

1 


0.1 

0.068 

0.070 

0.092 

0.071 

0.075 

0.073 

0.071 

0.074 

5 

(0.001) 

(0.002) 

(0.002) 

(0.002) 

(0.002) 

(0.002) 

(0.002) 

(0.002) 

500 

150 

50 

10 

1 


0.2 

0.0172 

0.0180 

0.0198 

0.0176 

0.0184 

0.0216 

0.0183 

0.0187 

o 

(0.0001) 

(0.0001) 

(0.0001) 

(0.0001) 

(0.0001) 

(0.0007) 

(0.0002) 

(0.0001) 

500 

100 

100 

10 

1 


0.3 

0.0202 

0.0209 

0.0222 

0.0204 

0.0214 

0.0222 

0.0208 

0.0213 

0 

(0.0001) 

(0.0002) 

(0.0001) 

(0.0001) 

(0.0001) 

(0.0003) 

(0.0001) 

(0.0002) 

50 

100 

100 

element-wise 



0.079 

0.078 

0.096 

0.080 

0.085 

0.081 

0.078 

0.081 


sparsity 

5 


(0.001) 

(0.001) 

(0.002) 

(0.001) 

(0.001) 

(0.001) 

(0.002) 

(0.001) 

50 

150 

50 

row-wise 



0.082 

0.076 

0.080 

0.079 

0.075 

0.083 

0.081 

0.102 


sparsity 

3 


(0.004) 

(0.003) 

(0.003) 

(0.003) 

(0.003) 

(0.003) 

(0.001) 

(0.004) 


Table 1: Comparison of six algorithms for different setups. We report mean and standard deviations of the MSE 
over the test sets, based on 20 simulation runs. 


Variable selection. In Figure [l] we compare the average signed sensitivity and specificity of different algorithms 
(based on 20 simulation runs) as the number of instances increase (other parameters kept fixed). We observe 
that our algorithm has higher sensitivity and specificity. This effect for specificity is reduced as the number of 
instances increases. This shows that our algorithm is more advantageous in high-dimensional settings where the 
number of instances is comparable to or less than the number of predictors/responses. Although we only show 
the plots for a specific parameter setting, the results are similar for other parameters. 



Figure 1: Sensitivity and specificity comparison of different algorithms as the number of instances increases. 
(p = 150, q = 50, m = 10, mo = 1, a n = 3, s = 0.3) 

Number of latent factors. In Table [2j we compare the number of estimated factors for the three algorithms that 
perform dimensionality reduction. For SRRR and SPLS, we find the number of factors by 5-fold cross-validation. 
The true number of factors is mentioned in the fourth column. We observe that our algorithm provides better 
estimates of the number of factors. 

Computation time. We also compare the computation time of different algorithms for the parameter settings 
corresponding to the first row of Table [I] We report the median computation time (on a PC with 16GB RAM 
and quad-core CPU at 3.4GHz) of each algorithm (excluding the cross-validation part) over 20 runs; SMFR (with 
prox-linear updates): 1.6s, SRRR: 5.6s, RemMap: 0.3s, SPLS: 0.3s, LASSO: 0.02s, and ti/tz'- 0.33s. Since the 
algorithms are implemented using different programming languages and stopping criteria vary slightly, care must 
be taken when interpreting these results. The main message is that SMFR and SRRR have larger computation 
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n 

V 

q 

m m o 


s 

SMFR 

SRRR 

SPLS 

50 

150 

50 

10 

1 

3 

0.2 

10 , 

10.3, 1.1 

8, 8.5, 1.4 

14, 

13.7, 3.1 

50 

150 

50 

10 

1 

3 

0.4 

10, 

10.4, 1.3 

10, 9.4, 1.2 

18, 

16.6, 3.1 

50 

150 

50 

10 

1 

5 

0.2 

11, 

11.1, 1.8 

7, 6.7, 2.4 

7, 

8.2, 2.9 

50 

150 

50 

15 

2 

3 

0.2 

14, 

13.5, 1.1 

13, 12.6, 2.3 

19, 

19.1, 1.1 

50 

100 

100 

10 

1 

5 

0.1 

7, 

7.3, 2.8 

6, 6.1, 3.0 

4, 

4.4, 1.3 

500 

150 

50 

10 

1 

3 

0.2 

10 

, 10, 0.0 

10, 9.9, 0.2 

15 

15, 0.0 

500 

100 

100 

10 

1 

5 

0.3 

10, 

10.1, 0.9 

10, 9.9, 0.2 

15 

15, 0.0 


Table 2: Median, mean, and standard deviation of estimated number of factors (based on 20 runs). 


times, since they solve more complicated problems which involve estimating the number of factors, and the 
loading and factoring matrices. This extra information has value in itself and provides more insight about the 
structure of data. Also, SMFR is almost three times faster than SRRR. 

SMFR initialization. In all the experiments above, and the ones in the next section, we use a random initial¬ 
ization for our algorithm where the elements of Ao are sampled independently from A/”(0,1). We compare this 
initialization with two other methods which are based on the matrix factorization of other solutions: (i) using 
the SVD of LASSO solution, Dlasso = USV T , setting A 0 = Ui :m Si :m and B 0 = (V T )i :m , where the subscript 
1 :m indicates choosing the first m columns; and (ii) using the SVD of trace-norm solution, Dxrace = USV T , 
setting A 0 = Ui ;m Si ;m and B 0 = (V T )i :m . We compare these three initializations for the case where the data 
is generated according to the first regime of Table 1 and vary the level of noise and sparsity. The results are 
shown in Table [3j The procedure to find the number of effective factors is the same for all three initializations. 
Also, as a baseline comparison, we show the results for the usual LASSO regression in the last row. The results 
reported in Table [3] are based on 20 runs (i.e., 20 different realizations of the data). It would also be interesting 
to examine how the results change based on different random initializations for a fixed realization of the data. 
In Table |4j we show the results for 5 realizations of the data. For each realization, we run our algorithm with 
20 different random initializations and report the mean and standard deviation of MSE over the test set. For 
comparison, we also include the MSE achieved by the decomposition of LASSO and trace norm solutions. 

The results show that our algorithm is very robust to the choice of the starting point. The results in both 
tables show that random initialization gives similar (or slightly better) results compared to more sophisticated 
initializations. Moreover, the very small standard deviations in the first row of Tableland the similarity of the 
error for all three types of initializations show that for a given realization of the data, different initializations 
lead to very similar estimates by our algorithm - another indication of the robustness of our algorithm to the 
choice of the starting point. 



a n = 3, s = 0.2 

a n = 3, s = 0.3 

a n = 5, s = 0.2 

a n = 5 , s = 0.3 

Random Initialization 

0.070 (0.003) 

0.075 (0.003) 

0.110 (0.005) 

0.116 (0.005) 

LASSO Initialization 

0.071 (0.003) 

0.076 (0.004) 

0.112 (0.004) 

0.117 (0.004) 

Trace Norm Initialization 

0.071 (0.004) 

0.076 (0.004) 

0.113 (0.004) 

0.119 (0.006) 

LASSO 

0.085 (0.004) 

0.097 (0.005) 

0.119 (0.005) 

0.135 (0.007) 


Table 3: Comparing different initialization techniques. (p = 150, q = 50, n = 50, and m = 10) 


Initialization 

Instance #1 

Instance #2 

Instance #3 

Instance #4 

Instance #5 

Random 

0.071 (0.0004) 

0.069 (0.001) 

0.065 (0.0004) 

0.070 (0.0005) 

0.071 (0.0005) 

LASSO 

0.071 

0.070 

0.065 

0.070 

0.072 

Trace Norm 

0.071 

0.070 

0.065 

0.071 

0.071 


Table 4: MSE variation over 20 different random initializations for the same problem (first row of Table 1). 


6. Application to Real Data 

We apply the proposed algorithm to real-world datasets and show that it exhibits better or similar predictive 
performance compared to state-of-the-art algorithms. We show that the factoring identified by our algorithm 
provides valuable insight into the underlying structure of the datasets. 
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6.1. Montreal’s bicycle sharing system (BIXI) 

The first dataset we consider provides information about Montreal’s bicycle sharing system called BIXI. 
The data contains the number of available bikes in each of the 400 installed stations for every minute. We 
use the data collected for the first four weeks of June 2012. From this dataset we form the set of predictors 
and responses as follows. We allocate two features to each station corresponding to the number of arrivals 
and departures of bikes to or from that station for every hour. The learning task is to predict the number of 
arrivals and departures for all the stations from the number of arrivals and departures in the last hour (i.e., 
a vector autoregressive model). The choice of this model is a compromise between accuracy and complexity. 
Mathematically, we want to estimate D such that Y ~ XD, where X tJ and X ti 4 oo+j respectively show the 
number of arrivals and departures in hour t at station j and Y t j and Y tj 4 oo+j respectively show the number of 
arrivals and departures in hour t + 1 at station j. 

We perform the prediction task on each of the four weeks. For each week, we take the data for the first 5 
days (120 data points; Friday to Tuesday) as the training set (with the fifth day data as the validation set), 
and the last two days as the test set (48 data points; Wednesday and Thursday). We compare the algorithms 
performing dimensionality reduction in terms of their predictive performance on the test sets and the number 
of chosen factors in Table [5j We also include LASSO and £ 1/^2 as baseline algorithms. To avoid showing 
very small numbers, we present the value of MSE, defined in ( |42| ), times nq. In terms of the prediction per¬ 
formance, we observe that our algorithm outperforms the others in all 4 weeks. We can also compare the 
algorithms in terms of the number of chosen factors. For our algorithm, we set r = 15 (the upper bound on 
the number of factors), and for the others, we do the cross-validation for 1 to 15 factors (15 different values). 
SMFR always uses all the 15 factors while the other two algorithms choose fewer factors (using cross-validation). 


week 


SMFR 

SRRR 

SPLS 

LASSO 

h/h 

1 

error 

557.3 

570.0 

1661 

580.4 

591.0 

factors 

15 

3 

7 

— 

— 

2 

error 

570.1 

602.2 

1888 

610.9 

623.7 

factors 

15 

3 

6 

— 

— 

3 

error 

618.8 

641.9 

2159 

643.4 

657.8 

factors 

15 

7 

5 

— 

— 

4 

error 

549.7 

594.6 

1621 

594.9 

588.0 


factors 

15 

3 

6 

— 

— 


Table 5: Total squared error (MSExng) and the number of factors for BIXI dataset 

To gain more insight into the quality of predictions, we randomly choose two features in week 4 and plot the 
predictions made by different algorithms over the test set in Figure [2j the results for other features are similar. 
The y-axis represents the cumulative number of bikes. We observe that SMFR provides a better fit to the data. 



Figure 2: Comparing fit of different algorithms for two sample cumulative features. Proposed algorithm performs 
better than others for most stations. 


We are using the data from the weekends as well as three weekdays to learn the prediction model, so it 
is reasonable to ask whether it is sensible to combine weekends and weekdays. Our preliminary data analysis 
indicated that the data on weekends are not particularly different from the other days, and thus we can include 
weekends in our training sets. For example, the correlation between the number of arrivals on weekends and the 
number of arrivals on weekdays across stations, is very high (around 0.9) for all four weeks, showing that the 

13 
























relative level of activity in a station (compared to other stations) is similar for weekends and weekdays; i.e, if 
a station has relatively high number of arrivals on the weekdays, it is highly likely that it will have a relatively 
high number of arrivals on the weekends too. We have repeated the experiment after removing the weekends 
and obtained similar results. 

To further investigate the variable selection of our algorithm, we run it on the whole data and examine 
the resulting factors. Three of these are shown in Figure [3] In these figures, all the bike stations are shown 
with green plus signs. For each factor, we show its constituent features; red crosses and blue circles correspond 
respectively to the departure and arrival features of each station that are present in that factor. Examining these 
factors provide useful insight into the data. For instance, the factor in Figure [3(a)| shows that the departures 
from populated residential areas (The Plateau, Mile End, Outremont) and arrivals at downtown (Ville Marie) 
are combined together to form a factor. This agrees with the intuition that many people are taking bikes to go 
from their homes to downtown where universities and businesses are located. The factor in Figure |3(b)| shows 
another strong effect which corresponds to the flow from the peripheries of downtown to more central locations 
(Place des Arts, Old Port). Many hotels and several universities are situated at the edge of downtown; numerous 
restaurants, cafes and tourist sites are located in the centre, and several festivals occurred there during June. 
The third factor represents the flow within this central part. 

Our model is expected to provide a better fit to the BIXI data since its assumed structure matches the 
underlying structure of bike movements. It is expected that generally, people ride from certain areas of the city 
to others. The factors capture this aggregated movement (similar to wavelets in the sense of smoothing over 
multiple individual sites). We expect the factors (matrix A) to be sparse because they capture movement from 
one region to another, and any stations outside these regions do not participate. We expect the loadings (matrix 
B) to be sparse because each station should only be predicted by those factors that involve it in terms of arrivals 
or departures. Therefore, as also confirmed by predictive performance, our proposed sparse, low-dimensional 
structure is a good fit to the data. 


45.56 
45.54 
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45.46 
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-73.6 -73.55 
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-73.6 -73.55 
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-73.5 


(a) Factor 1: populated residential areas to downtown (b) Factor 2: peripheral parts of downtown to central, 

popular parts 



(c) Factor 3: flow inside the central downtown 

Figure 3: Three of the factors identified in the BIXI dataset by our algorithm. Green plus signs show the 
stations, red crosses show the departure features and blue circles show the arrival features of a station chosen 
in the factor. 
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6.2. S&P 500 stocks 

We consider 294 companies from the S&P 500 [M] and collect their daily returns (percentage change in value 
from one day to the next) between March 1992 and December 2013 for a total of 5500 days. These returns 
are volatility adjusted using a GARCH model m and market adjusted by subtracting the market’s average 
return for each day. We use the global industry classification standard JT2j which categorizes all major public 
companies into 10 sectors. Since there are very few companies in the Telecom sector, we ignore this sector 
altogether. We divide the companies into two equal groups such that the number of companies from a specific 
sector is the same in each of the two groups. Our learning task is to predict the daily returns of the second 
group of companies from the first group. Although this is not a prediction task that would be of most interest 
in practice (where we want to predict future returns), it is a good test to examine the ability of an algorithm to 
extract the underlying factors and existing structure in the data. 

We divide the 5500 day period into 10 intervals of 550 days. For each interval, we choose the first 400 days as 
the training set (with the last 100 days as validation set) and the last 150 days as the test set. The average and 
standard deviation of MSE for SMFR is 128.8 (16.0), for SRRR is 129.2 (15.8), and for LASSO is 129.4 (17.1). 
There is minimal difference in the predictive performance of these algorithms on this dataset; however, we gain 
significant insight into the nature of data of by looking at the factors created by our algorithm. Figure [4] compares 
the resulting factors of SRRR and SMFR run over a period of 3000 days (this length is chosen to have clearer 
factors). We place companies from the same sectors next to each other in predictor and response matrices, and 
separate them with green lines. The sectors, from top to bottom, are Energy, Materials, Industrials, Consumer 
Discretionary, Consumer Staples, Health Care, Financials, IT, and Utilities. The proposed algorithm, SMFR, 
captures the sector factors to a good extent, whereas the structure of factors identified by SRRR is less clear. 




Factors 

(a) SMFR factors 


Factors 

(b) SRRR factors 



o 


- 0.1 


-0.2 



Figure 4: Factor matrices for SMFR and SRRR. 


6.3. Sparse PC A 

We compare our proposed fully sparse PCA with the well-known SPCA of Zou et al. introduced in m- 
We use our BIXI dataset again. Remember that there are 800 features in this dataset. We use the first 200 
data points of the dataset to simulate a high-dimensional setting. Thus, our data matrix, X, is 200 x 800. We 
compute the first 6 principal components and compare them using two metrics. First, we compute the adjusted 
explained variance; since the sparse principal components are not uncorrelated as in regular PCA, computing 
their explained variance separately is not correct. Before computing the explained variance of the fc’th principal 
component, regression projection is used to remove its linear dependence to components 1 to k — 1. See [F3] for 
more details on how to compute the adjusted explained variance. 

We also compare the loading sparsity. To have sparse components, we set the regularization parameters such 
that each component receives contributions from at most 10% of the features. As a benchmark, we also consider 
the simple thresholding where the values of the regular principal components with absolute value smaller than 
a threshold are set to zero (here, we keep the top 80). 

The results are summarized in Table [6] Compared to SPCA our algorithm explains more variance in the 
data (a total of 36.5% over the first 6 components compared to 18.6%) with sparser components. Also, we 
achieve a higher adjusted total variance compared to the simple thresholding (36.5% compared to 28.3%). 
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Adjusted Var (%) 

loading 

(IIA 

sparsity 

li,i) 

PC 

SMFR 

SPCA 

Thresholding 

SMFR 

SPCA 

1 

12.5 

8.7 

15.2 

78 

80 

2 

7.9 

2.9 

4.2 

45 

79 

3 

6.9 

2.0 

2.7 

70 

80 

4 

4.0 

1.9 

2.3 

62 

78 

5 

3.2 

1.7 

2.1 

58 

77 

6 

2.0 

1.4 

1.8 

35 

79 


Table 6: Adjusted explained variance and loading sparsity. 


7. Conclusion 

We introduced a new sparse multivariate regression algorithm which imposes a low-dimensional structure 
on the coefficient matrix by first decomposing it into the product of a long factor matrix and a wide loading 
matrix, with an elastic net penalty on the former and an t\ penalty on the latter. We also provided a for¬ 
mulation to infer the number of latent factors in a more effective way than current techniques. Although the 
problem formulation leads to a non-convex optimization problem, we showed convergence and optimality for 
an alternating minimization scheme (with three sets of updates). Through experiments on simulated and real 
datasets, we demonstrated that the proposed algorithm is able to exploit the existing structure in the data to 
improve predictive performance and model selection. 


Appendix A. Proof of Proposition 1 


Proof. For any given A and B, we have /(A, B) 
problems (23) and fl24|), the value of function / 
sequence of /(A, , Bj) converges to a limit value f* 


> 0. In both minimization steps of Algorithm 1 (i.e., 

is being decreased. Since / is bounded from below, the 
€ R. □ 


Appendix B. Proof of Theorem 3 

Proof of part (i) 


For problem (23), decomposing B into its columns, we can rewrite the minimization as follows: 

9 


B = argmin ^ <j -||Y« - XA i B (i) ||| + A 2 ||B«)|| 1 

B .7=1 


(B.l) 


where for any matrix, the superscript of (j) denotes its f th column. Therefore, problem (23) is equivalent to q 
separate Lasso problems, one for each response. 


Definition 3. A matrix X„ xp has its columns in general position if for any k < n, Xj ^ affjX^,..., Xj fc+1 }, Vj ^ 
{*i,..., ik+i}, where X, denotes the V th column of X and ‘aff’ denotes the affine span. 

hi [55] , Tibshirani shows that: 

Lemma 2 ( [36j . Lemmas 3 and 4). Assume that we have the following Lasso problem: 


min||y - Hb^ + A||bj|i. 
0 


If the columns of H are in general position, then for any y and A, the Lasso solution is unique with probability 
one. Moreover, if the entries of H € R nxm are drawn from a continuous probability distribution on R ram , then 
its columns are in general position and thus, for any y and X, the Lasso solution is unique with probability one. 


In problem (23), we have H = XA. If the elements of X are drawn from a continuous distribution, then its 
columns are in general position with probability one. In the following Lemma, we show that multiplying X by 
A with full column rank does not change this property and thus, the columns of H are also in general position. 
Therefore, according to Lemma |2j the solution of (23) is unique with probability one. 
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Lemma 3. If the columns ofX. nxp are in general position with probability one, and A pxm has full column rank, 
then the columns of XA are also in general position with probability one. 


Proof. Assume that for {* i, ..., ifc+i} and a j not in that set, we have XA, £ affjXA^ , ..., XAj fc+1 }. Then, 
for some ai,l = 1, + 1, we have: 


X (A j + ^ ' og Aj, j — 0. 


i=i 


(B.2) 


. . _ ... _I 1 

Since X has its columns in general position, (B.2) holds with a non-zero probability iff A, + 2Zi=i a iA-i t = 0, 
which is not possible because A has full column rank. □ 


Proof of part (ii) 


The objective function is strongly convex in A, so if there is a solution, it will be unique. 


Appendix C. Proof of Theorem 4 

Some of the proofs in this subsection exploit the biconvexity of the problem we are addressing and are based 
on the proofs of similar results in m- 


Proof of part (i) 

Definition 4. A is called the algorithmic map of Algorithm 1, if for C\ = (A^Bi) and C 2 = (A 2 ,B 2 ) we 
have: 

C 2 ed(Ci) iff /(A 1; B 2 ) < /(A 1; B),VB £ R mX9 
and /(A 2 ,B 2 )</(A,B 2 ),VAer xro . 


In other words, C 2 £ A(Ci) iff we can go from Ci to C 2 in one iteration of Algorithm 1. 
Lemma 4. The algorithmic map A is closed, i.e., we have: 


Cj — (Aj, Bj) and limj_,. 00 Cj — C 

C' £ A(Ci) and lim^ C' = C ^ > 

Proof. 

C' £ A( Q) =► /(Aj, B') < /(Aj, B), VB £ 

and/(A',B') < /(A,B(), VA £ K nxm 

Since / is continuous, we have: 

/(A*,B') = lim /(Aj, B') < lim /(A i; B) 

i—> oo i—t oo 

= /(A*,B) VB £ M. mxq 
/(A'.B') = lim /(A',B') < lim /(A,B() 

%—>oo i —>-oo 

= /( A.B') VA£R nxm 


(C.l) 


Thus, C' £ _4(C*), and A is closed. □ 

Lemma 5. For a given starting point, (Ao.Bq), the solutions {(Aj,Bj)}j g N stay in a bounded set. 

Proof. We have 

0 < AJAj|| M + A 2 ||Bj|| ljl + A 3 ||A||| < /(A,,Bj) < /(A 0 , B 0 ). (C.2) 

Thus, {(Aj, Bj)}j £ N stay in a bounded set. □ 

From Lemmas [4] and [5j we conclude that for a given starting point, the sequence of solutions {(A;,Bj)}j e N 
stays in a bounded, closed, and hence compact set and thus has at least one accumulation point. 
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Proof of part (ii) 

Since Q+i G A( C.j), we have: 


/(Aj,Bj + i) < /(Aj,B), VB G K mx? 
and /(Aj +1 , Bj + i) < /(A, B, +1 ), VA G R nxm 


Moreover, if we have f(C i+ i) = /(Cj), Then: 

/ (-A-i+i i Bt+i) = /(Aj,B i+1 ) = /(Aj,Bj 


(C.3) 


Therefore, if the solution of (23) is unique (i.e., Bj+i = B,), then Cj is a partial optimum, and if the solution 
of (241 is unique (i.e., Aj+i = Aj), then Cj+i is a partial optimum (the latter always hold because the solution 
of (24) is unique). 

We know that the sequence {C.ijjgpj has at least one accumulation point, say C*. Thus we have a convergent 
subsequence {C,}i g K with IcN that converges to C*. Similarly, {Cb + i}i e K has an accumulation point, say 
C + , to which a subsequence {Ci+i}jgL with Lcl converges. From Lemmajijwe get C + G A(C*), and using 
Proposition 1 we conclude /(C + ) = /(C*). Similarly, if C~ shows the accumulation point of {Ci_i}igK, we 
can show f(C~) = /(C*). 

Combining the results of these two paragraphs, if the solution of (23) is unique, /(C + ) = /(C*) implies that 
C* is partial optimum, and if the solution of (24) is unique, f(C~) = /(C*) implies that C* is partial optimum. 
Therefore, solution uniqueness of either (23) or (24) implies that C* is a partial optimum. 


Proof of part (Hi) 

We prove by contradiction; assume that ||Cj+i — Cj|| does not converge to zero. Then, for infinitely many 
i £ N, we have ||Ci_|_i — C*|| > 6 for a 5 > 0. Thus, denoting the accumulation points of sequences {Cj} ie isj and 
{C,+i respectively by C* and C + , we must have ||C* — C + || > <5 and hence C + ^ C*. On the other hand, 
since C* is a partial optimum and C + £ A{ C*), we have: 

/(A*, B*) = /(A*, B + ) = /(A+, B+). (C.4) 

Both A* and B* are full rank and thus, by Theorem 1, B + = B*, A + = A*, and consequently, C + = C*. This 
is in contradiction with the result of the previous paragraph and hence, ||Cj + i — C*|| converges to 0. 
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